Skip to main content

Internet Archive Bibliographic Metadata

Internet Archive Web Group

This collection contains both external ("upstream") metadata dumps and Internet Archive generated databases and reports on our holdings of papers, books, and other documents.

71
RESULTS
rss


PART OF
The Internet Archive
Media Type
70
data
1
texts
Year
8
2019
21
2018
11
2017
3
2016
1
2015
1
2014
More right-solid
Topics & Subjects
2
metadata
1
CC0
1
COCI
1
I4OC
1
Keeper's Reports
1
Metadata
More right-solid
Collection
More right-solid
Creator
11
internet archive web group
5
orcid, inc
4
allen institute for artificial intelligence
3
directory of open access journals
3
ncbi
2
aiminer.org
More right-solid
Language
2
English
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Internet Archive Bibliographic Metadata
data
eye 443
favorite 1
comment 0
This file is a snapshot dump of the Crossref DOI metadata API, containing entries for over 94 million DOIs. Compared to the previous 2017-03 version (see archive.org item "crossref_doi_dump_201703"), this snapshot has a few million more works, but the corpus size is much larger (29 GB compressed vs. 7 GB compressed) as it now contains significantly more citation data, due to the efforts of the Initiative for Open Citations (I4OC) project. This was generated by running the scripts...
Internet Archive Bibliographic Metadata
by Sci-Hub
data
eye 223
favorite 0
comment 0
On 2017-03-19, The Twitter user @Sci_Hub posted a list of 62,835,101 DOIs contained in Sci-Hub: https://twitter.com/Sci_Hub/status/843546352219017218 This item contains a copy of the list. This item contains no PDFs, papers, fulltext, or other copyrighted content. Important note: not all DOIs in this list are valid (aka, do not resolve via doi.org).
Internet Archive Bibliographic Metadata
by Microsoft Academic Search
data
eye 222
favorite 0
comment 0
This is a copy of the Microsoft Academic Graph corpus of scholarly publications and citations, based on crawls from the open web. Metadata (authors, DOI numbers, journals, citations, keywords, affiliations, etc) is included for more than 125 million publications. The corpus is a single 27GB zipfile that extracts into about 96GB of flat tab-separated text files, cross-referenced using identifier columns. Schema information can be found in the `readme.txt` file, and usage restrictions can be...
Internet Archive Bibliographic Metadata
data
eye 204
favorite 0
comment 0
A snapshot of the oaDOI DOI/URL database, including open access status for each paper. oaDOI is the API backing unpaywall; see oadoi.org for more details. This dataset is intended for NON-COMMERCIAL USE ONLY; contact oaDOI for details or commercial support.
Internet Archive Bibliographic Metadata
data
eye 172
favorite 0
comment 0
This file is a snapshot dump of the Crossref DOI metadata API, containing entries for over 99 million DOIs. This was generated by running the scripts at: https://github.com/greenelab/crossref (git commit: 768a49ba1d8ba1971f00471950514716a9f699c8) The script completed on 2018-09-20. Format is xz-compressed JSON (one JSON object per line).
Internet Archive Bibliographic Metadata
data
eye 146
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Archive Bibliographic Metadata
data
eye 129
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Archive Bibliographic Metadata
by CiteSeerX Group at PSU
data
eye 120
favorite 0
comment 0
This is a mirror of a CiteSeerX database dump, downloaded from S3. It's hosted here for easy Internet Archive analytics access, and so we don't need to re-pay S3 download fees. See also: http://csxstatic.ist.psu.edu/about/data
Internet Archive Bibliographic Metadata
by aiminer.org
data
eye 112
favorite 0
comment 0
A copy of the "Open Academic Graph" corpus published by aminer.org and Microsoft Academic Graph in Summer 2017. Contains almost 120 GB (compressed) of bibliographic metadata for hundreds of millions of publications. Related publications include: Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining...
Internet Archive Bibliographic Metadata
by ROAD: Directory of Open Access Scholarly Resources
data
eye 82
favorite 0
comment 0
This is a backup of ROAD/ISSN metadata from http://road.issn.org/en/contenu/download-road-records Dumps in both MARC XML and RDF format are included; see sub-directory for date of download. See also earlier July 2017 dump at: https://archive.org/download/road-issn-2017 These files are under the Creative Commons Attribution-NonCommercial 4.0 International Public License (aka, CC-BY-NC).
Topic: metadata
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 76
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.6084/m9.figshare.5479792
Internet Archive Bibliographic Metadata
by DIrectory of Open Access Journals
data
eye 66
favorite 0
comment 0
Downloaded from https://doaj.org/csv and the OAI-PMH interface. File names encode the date when data was downloaded.
Downloaded from https://core.ac.uk/services "The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and/or apply some computationally intensive batch processes. If you use CORE in your work, we kindly request you to cite one...
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 59
favorite 0
comment 0
Data-munged title-level metadata combined from: DOAJ, ROAD, Norwegian Register, and Internet Archive crawled metadata. See SOURCES.md for URLs of upstream metadata, and ISSN_matching.html for Jupyter notebook used to derive this dataset.
Internet Archive Bibliographic Metadata
by ROAD: Directory of Open Access Scholarly Resources
data
eye 58
favorite 0
comment 0
This is a backup of ROAD/ISSN metadata, downloaded July 3rd, 2017 from http://road.issn.org/en/contenu/download-road-records Dumps in both MARC XML and RDF format are included. These files are under the Creative Commons Attribution-NonCommercial 4.0 International Public License (aka, CC-BY-NC).
Topic: metadata
Internet Archive Bibliographic Metadata
by ISSN
data
eye 55
favorite 0
comment 0
Unlike most ISSN metadata, this mapping file is publicly available.
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 41
favorite 0
comment 0
Internet Archive Bibliographic Metadata
data
eye 40
favorite 0
comment 0
'crossref-works.json.xz' is the original file. 'works_crossref.elasticsearch.json.gz' contains a subset of metadata for most (but not all) works, restructured to be loaded directly into an Elasticsearch index. DOI: 10.6084/m9.figshare.4816720.v1 Via: https://figshare.com/articles/Metadata_for_all_DOIs_in_Crossref_JSON_MongoDB_exports_of_all_works_from_the_Crossref_API/4816720
Internet Archive Bibliographic Metadata
by Allen Institute for Artificial Intelligence
data
eye 33
favorite 0
comment 0
This is a snapshot of the AI@ (Semantic Scholar') "Open Research Corpus", as downloaded June 26th, 2017. These files originally downloaded from: http://labs.semanticscholar.org/corpus/ Note restrictions in the 'license.txt' file. 'index.html' is a backup of the landing page, that includes field content. 'papers-2017-02-21-sample.zip' is a subset of the data useful for exploration. Semantic Scholar is a project of the Allen Institute for Artificial Intelligence.
Internet Archive Bibliographic Metadata
by Datacite
data
eye 33
favorite 0
comment 0
This item contains snapshots of the Datacite OAI-PHM metadata feed, as captured with the tool 'metha'.
Downloaded from https://core.ac.uk/services "The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and/or apply some computationally intensive batch processes. If you use CORE in your work, we kindly request you to cite one...
Internet Archive Bibliographic Metadata
by Sci-Hub
data
eye 31
favorite 0
comment 0
This item contains a dump of download statistics as downloaded from Sci-Hub (see original_urls.txt) in March, 2018.
Internet Archive Bibliographic Metadata
by DIrectory of Open Access Journals
data
eye 31
favorite 0
comment 0
Downloaded from https://doaj.org/csv and the OAI-PMH interface.
Internet Archive Bibliographic Metadata
by Allen Institute for Artificial Intelligence
data
eye 28
favorite 0
comment 0
This is a snapshot of the AI2 (Semantic Scholar') "Open Research Corpus", as release May 3rd, 2018. These files originally downloaded from AWS S3, via: http://labs.semanticscholar.org/corpus/ Note restrictions in the 'license.txt' file. 'index.html' is a backup of the landing page, that includes field content. 'sample-S2-records.gz' is a subset of the data useful for exploration. Semantic Scholar is a project of the Allen Institute for Artificial Intelligence.
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 25
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.6084/m9.figshare.4134027
Internet Archive Bibliographic Metadata
by OCLC
data
eye 21
favorite 0
comment 0
This is a copy of the VIAF ("Virtual International Authority File") as downloaded from OCLC on 2018-03-07. Download urls are in the original_urls.txt text file. See also: https://viaf.org/viaf/data/
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 21
favorite 0
comment 0
This is a mapping between: - DOIs (Crossref) - PubMed PMID and PMCID (NIH) - CORE record identifier (core.ac.uk) - Wikidata QIDs See README and scripts for details.
Downloaded from https://core.ac.uk/services "The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and/or apply some computationally intensive batch processes. If you use CORE in your work, we kindly request you to cite one...
Internet Archive Bibliographic Metadata
by Wikidata Project
data
eye 20
favorite 0
comment 0
This item contains a copy of the 2018-09-03 snapshot of bibliographic metadata extracted from Wikidata. These datasets downloaded from: http://uri.gbv.de/wikicite/20180903/ More information at: https://github.com/wikicite/wikicite-data#readme and http://wikicite.org/
Standard paper bibliographic metadata corpuses (eg, Crossref, Pubmed, Arxiv) transformed into simple tab-separated and JSON formats.
Internet Archive Bibliographic Metadata
by Impactstory
data
eye 19
favorite 0
comment 0
A mirror of the Unpaywall (aka oaDOI.org) metadata corpus, primarily consisting of public open access flags for a large number of Crossref-registered DOIs (identifiers representing published journal articles and other works). For more information see: http://unpaywall.org/products/snapshot
Internet Archive Bibliographic Metadata
by Allen Institute for Artificial Intelligence
data
eye 18
favorite 0
comment 0
This is a snapshot of the AI@ (Semantic Scholar') "Open Research Corpus". These files originally downloaded from: http://labs.semanticscholar.org/corpus/ Note restrictions in the 'license.txt' file. 'index.html' is a backup of the landing page, that includes field content. 'papers-*-sample.zip' is a subset of the data useful for exploration. Semantic Scholar is a project of the Allen Institute for Artificial Intelligence.
Internet Archive Bibliographic Metadata
by Crossref
data
eye 17
favorite 0
comment 1
Metadata from the Crossref DOI registrar about "titles" (aka, individual Journals), in CSV format. Originally fetched from: https://wwwold.crossref.org/titlelist/titleFile.csv
( 1 reviews )
Internet Archive Bibliographic Metadata
data
eye 17
favorite 0
comment 0
This item contains work-level metadata about papers on academia.edu, obtained through their OAI-PMH interface.
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 16
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.14454/07243.2014.001
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 13
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.6084/m9.figshare.1582705
Internet Archive Bibliographic Metadata
by Japan Link Center
data
eye 13
favorite 0
comment 0
Downloaded from http://japanlinkcenter.org/top/material/material_metadata.html
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 12
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.14454/07243.2013.001
Internet Archive Bibliographic Metadata
data
eye 12
favorite 0
comment 0
This item contains a set of "Keeper's Reports" summarizing journal content preservation coverage from major archival services and networks (Portico, LOCKSS, CLOCKSS). See README for links to where these files were downloaded from.
Topics: Keeper's Reports, Metadata, Preservation
Internet Archive Bibliographic Metadata
data
eye 12
favorite 0
comment 0
As downloaded from: https://www.jstor.org/dfr/about/sample-datasets "The Early Journal Content (EJC) on JSTOR includes public domain journal articles published in the United States before 1923 and articles published in other countries before 1870, and includes discourse and scholarship in the arts and humanities, economics and politics, and in mathematics and other sciences. The EJC dataset includes full-text OCR and article-level metadata."
Internet Archive Bibliographic Metadata
by EuropePMC
data
eye 11
favorite 0
comment 0
Data mirrored from https://europepmc.org/downloads Contains a mapping between PubMed IDs (PMID), PubMedCentral IDs (PMCID), and DOI numbers, for over 29 million works.
Internet Archive Bibliographic Metadata
by DIrectory of Open Access Journals
data
eye 10
favorite 0
comment 0
Downloaded from https://doaj.org/csv and the OAI-PMH interface. File names encode the date when data was downloaded.
Internet Archive Bibliographic Metadata
data
eye 10
favorite 0
comment 0
Internet Archive Bibliographic Metadata
by CORE
data
eye 10
favorite 0
comment 0
This item contains mappings between CORE (https://core.ac.uk/) internal identifiers (simple integer numbers) and DOIs. This listing (a simple two-column TSV file) is derived from their publicly available metadata corpus.
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 9
favorite 0
comment 0
This item contains a complete PostgreSQL SQL database snapshot from https://fatcat.wiki, in binary 'pg_dump tar mode' format. With the exception of the 'abstracts' table (for which no aggregate license or copyright claims can be made; downstream users are responsible for their use), all metadata here is licensed CC-0 (public domain release) and may be used for any purpose. Downstream users are strongly encouraged to provide attribution and link here to the snapshot, as well as give credit to...
Internet Archive Bibliographic Metadata
by aiminer.org
data
eye 8
favorite 0
comment 0
A copy of the "Open Academic Graph v2" (OAGv2) corpus published by aminer.org and Microsoft Academic Graph in early 2019. Contains roughly 90 GB (compressed) of bibliographic metadata for hundreds of millions of publications. Related publications include: Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data...
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 8
favorite 0
comment 0
Contains a TSV file with SHA1, file size, wayback URLs, and metadata extracted from PDF by GROBID. Not intended for external use, but might be interested. DOES NOT CONTAIN FULLTEXT CONTENT.
Internet Archive Bibliographic Metadata
by Impactstory
data
eye 8
favorite 0
comment 0
A mirror of the Unpaywall (aka oaDOI.org) metadata corpus, primarily consisting of public open access flags for a large number of Crossref-registered DOIs (identifiers representing published journal articles and other works). For more information see: http://unpaywall.org/products/snapshot
This item contains a transformed copy (single gzip'd JSON-per-line file, instead of tarball of xz-zipped JSON per-source files) of the metadata in item https://archive.org/details/core_oa_metadata_20180301. All the same licenses and caveats apply.
Standard paper bibliographic metadata corpuses (eg, Crossref, Pubmed, Arxiv) transformed into simple tab-separated and JSON formats.
Internet Archive Bibliographic Metadata
by Norwegian Centre for Research Data
data
eye 7
favorite 0
comment 0
This item contains a snapshot of the "Norwegian Register for Scientific Journals, Series and Publishers", as downloaded from https://dbh.nsd.uib.no/publiseringskanaler/AlltidFerskListe. As the name indicates, this is a registry of international Journals (aka "titles", or "serials"); the scope is not limited to Norwegian or Nordic publications.
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 7
favorite 0
comment 0
This is a derivative of https://archive.org/download/ia_papers_manifest_2018-01-25, which contains JSON objects that can be inserted into a fatcat catalog.
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 6
favorite 0
comment 0
Test runs of large-scale matching algorithms (sha1 to DOI). Will likely be obsolete soon, and not useful for others.
Internet Archive Bibliographic Metadata
data
eye 5
favorite 0
comment 0
Downloaded from: https://zenodo.org/record/1438356
Copy of the MEDLINE 2017 Baseline of PubMed metadata, provided by the US National Libraries of Medicine (NLM)
Internet Archive Bibliographic Metadata
by Wikimedia Research
data
eye 3
favorite 0
comment 0
Contains (at least) a list of DOIs cited by various language Wikipedias as of March 2018. Transformed by Charles using lists linked from https://blog.wikimedia.org/2018/04/05/ten-most-cited-sources-wikipedia/
Internet Archive Bibliographic Metadata
by Silvio Peroni
texts
eye 3
favorite 1
comment 0
This dataset contains the all the citation data included in the triplestore of COCI archived on the 3rd of October 2018.
Topics: open citations, OpenCitations, COCI, RDF, triplestore, I4OC, open data, CC0
Internet Archive Bibliographic Metadata
by Jan Szczepanski
data
eye 2
favorite 0
comment 0
Downloaded from: https://www.ebsco.com/sites/g/files/nabnos191/files/acquiadam-assets/Jan-Szczepanski-Open-Access-Journals-2018_0.docx
Internet Archive Bibliographic Metadata
data
eye 2
favorite 0
comment 0
This is the 2019 "baseline" PubMed/MEDLINE bibliographic metadata corpus, originally published in December 2018. Downloaded from https://www.nlm.nih.gov/databases/download/pubmed_medline.html
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 1
favorite 0
comment 0
Snapshot of Internet Archive (petabox) file-level metadata (eg, PDF hashes) for files under the 'journals' collection as of December 2018. Note: includes a small number of items not actually under the 'journals' collection hierarchy due to how the input item list was generated, and a small fraction (estimate 500?) of items didn't dump successfully. A bit sloppy!
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 1
favorite 0
comment 0
This item contains hash lists of PDF files crawled from the public web specifically to preserve the scholarly record. It does not contain hashes of *all* PDFs the archive has ever seen, only a subset. Not all of these hashes are necessarily journal articles or other research outputs, but we have reason to believe the large majority are.
Internet Archive Bibliographic Metadata
by NCBI
data
eye 1
favorite 0
comment 0
Downloaded from: ftp://ftp.ncbi.nlm.nih.gov/pubmed/J_Entrez.txt
Internet Archive Bibliographic Metadata
data
eye 1
favorite 0
comment 0
This item contains a set of "Keeper's Reports" summarizing journal content preservation coverage from major archival services and networks (Portico, LOCKSS, CLOCKSS). See README for links to where these files were downloaded from.
Internet Archive Bibliographic Metadata
data
eye 1
favorite 0
comment 0
This item contains snapshots of the PubMed Central OA subset file manifests, linked from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist
Internet Archive Bibliographic Metadata
by moreo.info
data
eye 1
favorite 0
comment 0
Internet Archive Bibliographic Metadata
by Allen Institute for Artificial Intelligence
data
eye 0
favorite 0
comment 0
This is a backup of the "Open Academic Search" corpus, published by Semantic Scholar / Allen Institute for AI. For more info see http://labs.semanticscholar.org/corpus/. In particular, note the terms and conditions, and the request: We request that any published research that makes use of this data cites the following paper: Waleed Ammar et al. 2018. Construction of the Literature Graph in Semantic Scholar. NAACL. ...
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 0
favorite 0
comment 0
Downloaded from: https://grid.ac/downloads
Internet Archive Bibliographic Metadata
by Internet Archive Web Group
data
eye 0
favorite 0
comment 0
This item contains bulk metadata exported from https://fatcat.wiki. With the exception of the 'abstracts' file (for which no aggregate license or copyright claims can be made; downstream users are responsible for their use), all metadata here is licensed CC-0 (public domain release) and may be used for any purpose. Downstream users are strongly encouraged to provide attribution and link here to the snapshot, as well as give credit to upstream sources (including Crossref, ORCID, DOAJ, the ISSN...